Incremental Learning Algorithm for Dynamic Data Streams
نویسنده
چکیده
The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields. These measurements are generated continuously and in a very high fluctuating data rates. Examples include sensor networks, web logs, and computer network traffic. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. The research in data stream mining has gained a high attraction due to the importance of its applications and the increasing generation of streaming information. Applications of data stream analysis can vary from critical scientific and astronomical applications to important business and financial ones. Algorithms, systems and frameworks that address streaming challenges have been developed from the past few years. This paper presents a system for induction of forest of functional trees from data streams able to detect concept drift. The Ultra Fast Forest of Trees (UFFT)is an incremental algorithm, which works online, processing each example in constant time, and performing a single scan over the training examples. It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test. For multi-class problems the algorithm builds a binary tree for each possible pair of classes, leading to a forest of trees. Decision nodes and leaves contain naive-Bayes classifiers playing different roles during the induction process. Naive-Bayes in leaves are used to classify test examples. Naive-Bayes in inner nodes play two different roles. They can be used as multivariate splitting-tests if chosen by the splitting criteria, and used to detect changes in the classdistribution of the examples that traverse the node. When a change in the class-distribution is detected,all the subtree rooted at that node will be pruned. The use of naiveBayes classifiers at leaves to classify test examples, the use of splitting-tests based on the outcome of naive-Bayes, and the use of naive-Bayes classifiers at decision nodes to detect changes in the distribution of the examples are directly obtained from the sufficient statistics required to compute the splitting criteria, without no additional computations. This aspect is a main advantage in the context of high-speed data streams. This methodology was tested with artificial and real-world data sets. The experimental results show a very good performance in comparison to a batch decision tree learner, and high capacity to detect drift in the distribution of the examples.
منابع مشابه
Info-fuzzy algorithms for mining dynamic data streams
Most data mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data mining model generated from past examples to become le...
متن کاملIncremental Granular Fuzzy Modeling Using Imprecise Data Streams
System modeling in dynamic environments needs processing of streams 1 of sensor data and incremental learning algorithms. This paper suggests an incre2 mental granular fuzzy rule-based modeling approach using streams of fuzzy inter3 val data. Incremental granular modeling is an adaptivemodeling framework that uses 4 fuzzy granular data that originate from unreliable sensors, imprecise perceptio...
متن کاملAn Efficient Incremental Algorithm to Mine Closed Frequent Itemsets over Data Streams
The purpose of this work is to mine closed frequent itemsets from transactional data streams using a sliding window model. An efficient algorithm IMCFI is proposed for Incremental Mining of Closed Frequent Itemsets from a transactional data stream. The proposed algorithm IMCFI uses a data structure called INdexed Tree(INT) similar to NewCET used in NewMoment[5]. INT contains an index table Item...
متن کاملAn Empirical Comparison of Bayesian Network Parameter Learning Algorithms for Continuous Data Streams
We compare three approaches to learning numerical parameters of Bayesian networks from continuous data streams: (1) the EM algorithm applied to all data, (2) the EM algorithm applied to data increments, and (3) the online EM algorithm. Our results show that learning from all data at each step, whenever feasible, leads to the highest parameter accuracy and model classification accuracy. When fac...
متن کاملDynamic Weighted Majority for Incremental Learning of Imbalanced Data Streams with Concept Drift
Concept drifts occurring in data streams will jeopardize the accuracy and stability of the online learning process. If the data stream is imbalanced, it will be even more challenging to detect and cure the concept drift. In the literature, these two problems have been intensively addressed separately, but have yet to be well studied when they occur together. In this paper, we propose a chunk-ba...
متن کاملComparative Study of Incremental Learning Algorithms in Multidimensional Outlier Detection on Data Stream
Multi-dimensional outlier detection (MOD) over data streams is one of the most significant data stream mining techniques. When multivariate data are streaming in high speed, outliers are to be detected efficiently and accurately. Conventional outlier detection method is based on observing the full dataset and its statistical distribution. The data is assumed stationary. However, this convention...
متن کامل